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Abstract 

In this paper, we study the problem of hyperspectral pixel classification based on the recently proposed architec¬ 
tures for compressive whisk-broom hyperspectral imagers without the need to reconstruct the complete data cube. 
A clear advantage of classification in the compressed domain is its suitability for real-time on-site processing of 
the sensed data. Moreover, it is assumed that the training process also takes place in the compressed domain, thus, 
isolating the classification unit from the recovery unit at the receiver’s side. We show that, perhaps surprisingly, using 
distinct measurement matrices for different pixels results in more accuracy of the learned classifier and consistent 
classification performance, supporting the role of information diversity in learning. 

Index Terms 

Hyperspectral imaging, remote sensing, compressive whisk-broom sensing, pixel classification. 


I. Introduction 

Recently, there has been a surge toward compressive architectures for hyperspectral imaging and remote sensing 
fill . This is mainly due to the increasing amount of hyperspectral data that is being collected by high-resolution 
airborne imagers such as NASA’s AVIRISfjJand the fact that a large portion of data is discarded during compression 
or during feature mining prior to learning |J2]| . It has been noted in 0 that many of the proposed compressive 
architectures are based on the spatial mixture of pixels across each frame and correspond to physically costly or 
impractical operations while most existing airborne hyperspectral imagers employ scanning methods to acquire a 
pixel or a line of pixels at a time. To address this issue, practical designs of compressive whisk-broom and push- 
broom cameras were suggested in 0. In this work, we tackle the problem of hyperspectral pixel classification based 
on compressive whisk-broom sensors; i.e. each pixel is measured at a time using an individual random measurement 
matrix. Extension of the presented analysis for the compressive push-broom cameras is straightforward. 

To set this work apart from existing efforts that have also focused on the problem of classification from the 
compressive hyperspectral data, such as 0, we must mention two issues with the typical indirect approach of 
applying the classification algorithms to the recovered data: a) the sensed data cannot be decoded at the sender’s 
side (airborne device) due to the heavy computational cost of compressive recovery, making on-site classification 
infeasible, b) the number of measurements (per pixel) may not be sufficient for a reliable signal recovery. It 
has been established that classification in the compressed domain would succeed with far less number of random 
measurements than it is required for a full data recovery 0. However, the compressive framework of 0 corresponds 
to using a fixed projection matrix for all pixels which limits the measurement diversity that has been promoted by 
several recent studies for data recovery and learning [51, 0, 0. 

Rather than devising new classification algorithms, this work is focused on studying the relationship between the 
camera’s sensing mechanism, namely the employed random measurement matrix, and the common Support Vector 
Machine (SVM) classifier. It must be emphasized that the general problem of classification based on compressive 
measurements has been addressed for the case where a fixed measurement matrix is used 0,0. However, our aim 
is to study the impact of measurement diversity on the learned classifier. In particular, we investigate two different 
sensing mechanisms that were introduced in ll3l [^} 
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“For more details regarding the physical implementation of compressive whisk-broom sensors, we refer the reader to [3 which illustrates 
conceptual schematics of whisk-broom and push-broom cameras. 
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Complete data FCA-sensed data DMD-sensed data 

Fig. 1. FCA-based versus DMD-based sensing. Here, rows represent pixels and columns represent spectral bands. 
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FCA-based sensor: A Fixed Coded Aperture (FCA) is used to modulate the dispersed light before it is 
collected at the linear sensor array. This case corresponds to using a fixed measurement matrix for each pixel 
and a low-cost alternative to the DMD system below. 

DMD-based sensor: A Digital Micromirror Device (DMD) is used to modulate the incoming light according 
to an arbitrary pattern that is changed for each measurement. Unlike the previous case, DMD adds the option 
of sensing each pixel using a different measurement matrix. Both cases are illustrated in Figure [T] 

SVM has been shown to be a suitable classifier for hyperspectral data 0. Specifically, we employ an efficient 
linear SVM classifier with the exponential loss function that gives a smooth approximation to the hinge-loss. To train 
the classifier in the compressed domain, we must sketch the SVM loss function using the acquired measurements for 
which we employ some of the techniques developed in j9j. Furthermore, given that the sketched loss function gives 
a close approximation to the true loss function and that the learning objective function is smooth, it is expected 
that the learned classifier is close to the ground-truth classifier based on the complete hyperspectral data (which 
is unknown). As it has been discussed in flOl . recovery of the classifier is of independent importance in some 
applications. 

This paper is organized as follows. In the Section [II] we present the learning algorithm that gets the compressive 
measurements as input and produces a linear pixel classifier in the signal domain. Section IHI] contains the simulation 


results and their analysis. We conclude the paper in Section IV 


II. Problem Formulation and the Proposed Framework 
A. Oven’iew of SVM for spectral pixel classification 

In a supervised hyperspectral classification task, a subset of pixels are labeled by a specialist who may have 
access to the side information about the imaged field such as being physically present at the field for measurement. 
The task of learning is then to employ the labeled samples for tuning the parameters of the classification machine 
to predict the pixel labels for a field with similar material compositions. Note that, for subpixel targets, an extra 
stage of spectral unmixing is required to separate different signal sources involved in generating a pixel’s spectrum 
fl4l . For simplicity, we assume that the pixels are homogeneous (consist of single objects). 

Recall that most classifiers are inherently composed of binary decision rules. Specifically, in multi-categorical 
classification, multiple binary classifiers are trained according to either One-Against-All (OAA) or One-Against-One 
(OAO) schemes and voting techniques are employed to combine the results lfT5l . In a OAA-SVM classification 
problem, a decision hyperplane is computed between each class and the rest of the training data, while in a OAO 
scheme, a hyperplane is learned between each pair of classes. As a consequence, most studies focus on the canonical 
binary classification. Similarly in here, our analysis is presented for the binary classification problem which can be 
extended to multi-categorical classification. 

In the linear SVM classification problem, we are given a set of training data points (corresponding to hyperspectral 
pixels) Xj G for j = 1, 2,..., n and the associated labels Zj G {— 1,+1}. The inferred class label for Xj is 
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sign(.xju; — b ) that depends on the classifier w E R d and the bias term b £ M. The classifier w is the normal 
vector to the affine hyperplane that divides the training data in accordance with their labels. When the training 
classes are inseparable by an affine hyperplane, maximum-margin soft-margin SVM is used which relies on a loss 
fimction to penalize the amount of misfit. For example, a widely used loss function is £{r) = (max{0,1 — r}) p 
with r = Zj(xjw — b ). For p = 1, this loss function is known as the hinge loss, and for p = 2, it is called the 
squared hinge loss or simply the quadratic loss. The optimization problem for soft-margin SVM become^] 

1 n \ 

(■ w*,b*) = argmin — '^2i(zj(xjw — b )) + — H^Hi (1) 

W ’ n j =i 

In this paper, we use the smooth exponential loss function, which can be used to approximate the hinge loss while 
retaining its margin-maximization properties ifTTl : 

i{z) = e~ lz (2) 

where 7 controls the smoothness. We use 7 = 1 . 


B. SVM in the compressed domain 

Let ijj = TyXq £ denote the low-dimensional measurement vector for pixel j where d' < d is size of the 
photosensor array in the compressive whisk-broom camera 0. As explained in fi~2l l. a DMD architecture can be 
used to produce a <I> 7 with random entries in the range [0,1] or random ±1 entries, resulting in a sub-Gaussian 
measurement matrix that satisfies the isometry conditions with a high probability fl3ll . Recall that the measurement 
matrix <:I> 7 is fixed in a FCA-based architecture while it can be distinct for each pixel in a DMD-based architecture. 

As noted in |[9|], the orthogonal projection onto the row-space of <b 7 can be computed as Pj = <f>J (TjTjjp 1 <1> 7 . 
Consequently, an (unbiased) estimator for the inner product xjw (assuming a fixed Xj and w) based on the 
compressive measurements would be yj As a result, the soft-margin SVM based on the compressive 
measurements can be expressed as: 

1 n 7 

w* = argmin- ^2 £(z j yJ(<$> j <S>Jy 1 <S>jw) + - \\w\\l (3) 

n 3 =l 


(we have omitted the bias term b for simplicity). 

We must note that the formulation in Q is different from what was suggested in 0 for a fixed measurement 
matrix. In particular, we solve for w* in the d-dimensional space. Meanwhile, the methodology in 0 would result 
in the following optimization problem: 


w 


* 


1 

= arg mm — 
w n 


J2l(zjyJw) + 
3 = 1 



(4) 


which solves for w* in the low-dimensional column-space of <I>. Also note that, in the case of fixed measurement 
matrices, ^ and © correspond to the same problem with the relationship w* = $ T (<I><f> T ) _1 w* (because of the 
£2 regularization term which zeros the components of w* which lie in the null-space of <I>). In other words, © 
represents a generalization of Q for the case when the measurement matrices are not necessarily the same. This 
allows us to compare the two cases of a) having a fixed measurement matrix and b) having a distinct measurement 
matrix for each pixel, which is the subject of this paper. For simplicity, assume that each ( h ; consists of a subset 
of d' rows from a random orthonormal matrix, or equivalently T, <l> j = I ( y\ thus, Pj = Also assume that, 

in the case of DMD-based sensing, each d> 7 is generated independently of the other measurement matrices. 

Following the recent line of work in the area of randomized optimization, for example ffl9l . we refer to the new 
loss £(zjxJ^ J j (TyT- )- 1< l>j W ) as the sketch of the loss, or simply the sketched loss to distinguish it from the true 
loss £(zjxjw). Similarly, we refer to w* as the sketched classifier as opposed to the ground-truth classifier w*. 


3 Discussion: Similar results can be obtained using the dual form. Recent works have shown that advantages of the dual form can be 
obtained in the primal as well GE As noted in CD, the primal form convergences faster to the optimal parameters (w* ,b*) than the dual 
form. For the purposes of this work, it is more convenient to work with the primal form of SVM although the analysis can be properly 
extended to the dual form. 
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FCA-sensed data 


DMD-sensed data 


Fig. 2. Linear SVM classification —depicted for d = 2 for illustration. Small arrows represent each t&j G R lx . 


Figure [2] depicts the two cases of using a fixed measurement matrix (FCA-sensed data) and distinct measurement 
matrices (DMD-sensed data) for training a linear classifier. It is helpful to imagine that, in the sketched problem, each 
Xj is multiplied with Pjw (the projection of w onto the column-space of ( l’j) since yj <f >jw = ( PjXj) T w = xj ( Pjw ). 
As shown in Figure [2] (left) with Pj = P for all j, there is a possibility that w* would nearly align with the null- 
space of the random low-rank matrix P = <I> 7 <f>. For such P, any vector Pw may not well discriminate between 
the two classes and ultimately result in the classification failure. Figure [2] (right) depicts the case when a distinct 
measurement is used for each point. When ‘by is symmetrically distributed in the space and n is large, there is 
always a bunch of <h ? ’s that nearly align with w* whereas other <bj ’s can be nearly orthogonal to w* or somewhere 
between the two extremes. This intuitive example hints about how measurement diversity pays off by making the 
optimization process more stable with respect to the variations in the random measurements and the separating 
hyperplane. 


III. Simulations 

A. Handling the bias term 

It is not difficult to see that employing a distinct ( I> ? for each data vector Xj necessitates having distinct values 
of bias bj (for each ( I> ? ). Note that in the case of fixed measurement matrix, i.e. when <l> j = <b for all j, bias terms 
would be all the same and linear SVM works normally as noted in O. However, using a customized bias term for 
each point would clearly result in overfitting and the learned vj* would be of no practical value. Furthermore, the 
classifier cannot be used for prediction since the bias is unknown for the new input samples. In the following, we 
address these issues. 

First, let S denote a set of k distinct measurement matrices, i.e. S = {‘bW, ..., <b( fc )}. Instead of using an 

arbitrary measurement matrix for each pixel, we draw an entry from S for each pixel. Given that n S> k, each 
element of S is expected be utilized for more than once. This allows us to learn the bias for each outcome of 
measurement matrix (without the overfitting issue). Note that k signifies the degree of measurement diversity: k = 1 
refers to the least diversity, i.e using a fixed measurement matrix, and measurement diversity is increased with k. 
The new optimization problem becomes: 

(w*,b*,...,b%) = arg min ^|M|| + 

1 U 

3 = 1 


( 5 ) 
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FCA measurement 



Accuracy 

DMD measurement 


Fig. 3. Distributions of the classification accuracy (Asphalt vs. Meadows) for the Pavia University dataset (d! = 1). 


where tj randomly (uniformly) maps each j E {1,2,..., n} to an element of {1,2,..., A;}. The overfitting issue 
can now be restrained by tuning k\ reducing k results in less overfitting. In our simulations, we use k > \d/d'~\ to 
ensure that S spans R d with a probability close to one. 

For prediction, the corresponding bias term is selected from the set {&{, b * 2 ,..., b* k }. 

B. Results 

The dataset used in this section is the well-known Pavia University dataset lfl8l which is available with the 
ground-truth label^J] For each experiment, we perform a 2-fold cross-validation with 1000 training and 1000 
testing samples. As discussed earlier, multi-categorical SVM classification algorithms typically rely on pair-wise 
or One-Against-One (OAO) classification results. Hence, we evaluate the sketched classifier on a OAO basis by 
reporting the pair-wise performances in a table . Finally, since the measurement operator is random and subject to 
variation in each experiment, we repeat each experiment for 1000 times and perform a worst-case analysis of the 
results. 

Consider the case where a single measurement is made from each pixel, i.e. d! = 1 and ( 1> :I E R lxrf is a random 
vector in the d-dimensional spectral space. Clearly, this case represents an extreme scenario where the signal 
recovery would not be reliable and classification in the compressed domain becomes crucial, even at the receiver’s 
side where the computational cost is not of greatest concern. For performance evaluation, we are interested in two 
aspects: (a) the prediction accuracy over the test dataset, (6) the recovery accuracy of the classifier (with respect 
to the ground-truth classifier) —whose importance has been discussed in iflOl . 

We define the classification accuracy as the minimum (worst) of the True Positive Rate (sensitivity) and the 
True Negative Rate (specificity). Figure [3] shows an instance of the distribution of the classification accuracy for a 
pair of classes over 1000 random trials. As it can be seen, in the presence of measurement diversity, classification 
results arc more consistent (reflected in the low variance of accuracy). Due to the limited space, we only report 
the worst-case OAO accuracies (i.e. the minimum pair-wise accuracies among 1000 trials) for the Pavia scene. The 
results for the case of one-measurement-per-pixel (d' = 1) are shown in Tables [I] and [II] Similarly, the results for 
the case of d! = 3 (which is equivalent to the sampling rate of a typical RGB color camera) are shown in Tables 


of one) when the classes are not linearly separable. To see this, we have reported ground-truth accuracies in Table 

[3 

4 http://www.ehu.eus/ccwintco/ 

5 The Indian Pines dataset was not included due to the small size of the image which is not sufficient for a large-scale cross-validation 
study. 


Ill and IV Note that the employed SVM classifier is linear and would not result in perfect accuracy (i.e. accuracy 
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TABLE I 

One FCA measurement per pixel: worst-case classification accuracies (1000 trials) for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

0.45 

0.38 

0.42 

0.36 

0.44 

Meadow 


0.48 

0.48 

0.41 

0.47 

Gravel 



0.44 

0.44 

0.44 

Trees 




0.42 

0.53 

Soil 





0.44 


TABLE II 

One DMD measurement per pixel: worst-case classification accuracies (1000 trials) for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

0.71 

0.64 

0.79 

0.60 

0.71 

Meadow 


0.72 

0.61 

0.46 

0.73 

Gravel 



0.79 

0.60 

0.44 

Trees 




0.69 

0.79 

Soil 





0.60 


To measure the classifier recovery accuracy, we compute the cosine similarity, or equivalently the correlation, 
between w* and w*\ 


C(w*,w*) = 


(w*,w* 


r 2 r 2 


In Tables VI and VII[ we have reported the average recovery accuracy for the case of three-measurements-per-pixel 
(i.e. df = 3). 


IV. Conclusion 


In the field of ensemble learning, it has been discovered that the diversity among the base learners enhances 
the overall learning performance 1 1201 . Meanwhile, our aim has been to exploit the diversity that can be efficiently 
built into the sensing system. Both measurement schemes of pixel-invariant (measurement without diversity) and 
pixel-varying (measurement with diversity) have been suggested as practical designs for compressive hyperspectral 
cameras 0. The presented analysis indicates that employing a DMD would result in more accurate recovery of 
the classifier and a more stable classification performance compared to the case when an FCA is used. Meanwhile, 
for tasks that only concern class prediction (and not the recovery of the classifier), FCA is (on average) a suitable 
low-cost alternative to the DMD architecture. 


TABLE III 

Three FCA measurements per pixel: worst-case classification accuracies (1000 trials) for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

0.61 

0.80 

0.94 

0.63 

0.86 

Meadow 


0.67 

0.82 

0.50 

0.62 

Gravel 



0.94 

0.62 

0.54 

Trees 




0.89 

0.93 

Soil 





0.66 


TABLE IV 

Three DMD measurements per pixel: worst-case classification accuracies (1000 trials) for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

0.91 

0.76 

0.96 

0.87 

0.84 

Meadow 


0.90 

0.82 

0.57 

0.91 

Gravel 



0.95 

0.82 

0.49 

Trees 




0.93 

0.96 

Soil 





0.80 
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TABLE V 

Ground-truth accuracies for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

1.00 

0.97 

0.97 

1.00 

0.94 

Meadow 


0.99 

0.96 

0.89 

0.99 

Gravel 



1.00 

1.00 

0.86 

Trees 




0.98 

1.00 

Soil 





0.99 


TABLE VI 

Three FCA measurements per pixel: average recovery accuracy (1000 trials) for the Pavia scene. 


Classes 

Meadow 

Gravel 

Trees 

Soil 

Bricks 

Asphalt 

0.051 

0.055 

0.113 

0.056 

0.048 

Meadow 


0.100 

0.033 

0.019 

0.077 

Gravel 



0.122 

0.064 

0.050 

Trees 




0.017 

0.123 

Soil 





0.031 
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0.140 

0.380 

Gravel 



0.617 

0.272 

0.197 

Trees 




0.102 

0.582 

Soil 





0.128 





[15] F. Melgani and L. Bruzzone, "Classification of hyperspectral remote sensing images with support vector machines” IEEE Transactions 
on Geoscience and Remote Sensing , vol. 42, no. 8, pp. 17781790, August 2004. 

[16] O. Chapelle, "Training a support vector machine in the primal,” Neural Computing , vol. 19(5), pp. 11551178, 2007. 

[17] This dataset was gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana and consists of 145 x 145 pixels 
and 224 spectral reflectance bands in the wavelength range 0.4 to 2.5e-6 meters. 

[18] This scene was acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy. The number of spectral bands is 
103 and the spatial resolution is 610 x 610 pixels. Ground-truth consists of 9 classes. 

[19] M. Pilanci, Martin J. Wainwright, “Randomized Sketches of Convex Programs with Sharp Guarantees,” arXiv: 1404.7203 [cs.IT], April 
2014. 

[20] B. Waske, S. Van Der Linden, J.A. Benediktsson, A. Rabe and P. Hostert, “Sensitivity of support vector machines to random feature 
selection in classification of hyperspectral data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, pp. 28802889. 2010. 



